2 research outputs found
On Generalization Bounds for Projective Clustering
Given a set of points, clustering consists of finding a partition of a point
set into clusters such that the center to which a point is assigned is as
close as possible. Most commonly, centers are points themselves, which leads to
the famous -median and -means objectives. One may also choose centers to
be dimensional subspaces, which gives rise to subspace clustering. In this
paper, we consider learning bounds for these problems. That is, given a set of
samples drawn independently from some unknown, but fixed distribution
, how quickly does a solution computed on converge to the
optimal clustering of ? We give several near optimal results. In
particular,
For center-based objectives, we show a convergence rate of
. This matches the known optimal bounds
of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]
and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for -means
and extends it to other important objectives such as -median.
For subspace clustering with -dimensional subspaces, we show a convergence
rate of . These are the first
provable bounds for most of these problems. For the specific case of projective
clustering, which generalizes -means, we show a convergence rate of
is necessary, thereby proving that the
bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical
Society 2016] are essentially optimal
The Power of Uniform Sampling for Coresets
Motivated by practical generalizations of the classic -median and
-means objectives, such as clustering with size constraints, fair
clustering, and Wasserstein barycenter, we introduce a meta-theorem for
designing coresets for constrained-clustering problems. The meta-theorem
reduces the task of coreset construction to one on a bounded number of ring
instances with a much-relaxed additive error. This reduction enables us to
construct coresets using uniform sampling, in contrast to the widely-used
importance sampling, and consequently we can easily handle constrained
objectives. Notably and perhaps surprisingly, this simpler sampling scheme can
yield coresets whose size is independent of , the number of input points.
Our technique yields smaller coresets, and sometimes the first coresets, for
a large number of constrained clustering problems, including capacitated
clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in
minor-excluded graph, and polygon clustering under Fr\'{e}chet and Hausdorff
distance. Finally, our technique yields also smaller coresets for -median in
low-dimensional Euclidean spaces, specifically of size
in and
in